feat(html): add `normalize` function for HTML entities (#4523) #4524

lionel-rowe · 2024-03-26T11:26:47Z

Various bikeshedding things:

Given its dual use for XML, should the top-level dir have its name changed from html? If so, to what?
- xml suffers the same issue but in reverse
- html_and_xml is pretty gross
- markup seems overly vague
- sgml_like might be technically correct, but would the average web developer think to look there?
Is it worth exporting escapeAllCharsAsHex? Could be useful for users wanting finer-grained control than the 2 normalization forms, but it's a very simple function to replicate
Is the name NormalizationForm OK or is it too confusing with Unicode normalization forms (NFC, NFD, etc?)

kt3k · 2024-04-16T15:19:41Z

html/entities.ts

+function escapeXmlRestricted(str: string) {
+  return str.replaceAll(
+    // deno-lint-ignore no-control-regex
+    /[^\x09\x0a\x0d\x20-\x7e\x85\xa0-\ud7ff\ue000-\ufdcf\ufdf0-\ufffd\u{10000}-\u{1fffd}\u{20000}-\u{2fffd}\u{30000}-\u{3fffd}\u{40000}-\u{4fffd}\u{50000}-\u{5fffd}\u{60000}-\u{6fffd}\u{70000}-\u{7fffd}\u{80000}-\u{8fffd}\u{90000}-\u{9fffd}\u{a0000}-\u{afffd}\u{b0000}-\u{bfffd}\u{c0000}-\u{cfffd}\u{d0000}-\u{dfffd}\u{e0000}-\u{efffd}\u{f0000}-\u{ffffd}\u{100000}-\u{10fffd}]+/gu,


Why do we start escaping these chars by default?

Also does HTML have the same restricted chars concept as XML?

Upon reflection, this PR needs some more thought. I'd initially thought a sort of "baseline-compatible with both HTML and XML" would be a sensible default, given that these characters are rare in practice yet could cause problems in XML. But it turns out that entities for certain C1 control-character codepoints have different semantics in HTML than XML — for example, :

['application/xml', 'text/html'].map((contentType) => { const { literal, entity } = JSON.parse( new DOMParser().parseFromString( '<div>{"literal": "\x80", "entity": ""}</div>', contentType, ).querySelector('div').textContent, ) return { contentType, literal, entity } }) // { "contentType": "application/xml", "literal": "\x80", "entity": "\x80" } // { "contentType": "text/html", "literal": "\x80", "entity": "€" }

Also, I'm not sure converting those characters to entities is the right approach. Probably simply stripping them out would be more sensible in most cases.

kt3k · 2024-04-16T15:20:58Z

html/entities.ts

@@ -34,15 +38,23 @@ const rawRe = new RegExp(`[${[...rawToEntity.keys()].join("")}]`, "g");
 * // Characters that don't need to be escaped will be left alone,
 * // even if named HTML entities exist for them.
 * escape("þð"); // "þð"
+ * // You can force non-ASCII chars to be escaped by setting the `form` option to `compatibility`:
+ * escape("þð", { form: "compatibility" }); // "&#xfe;&#xf0;"


form: "compatibility" sounds confusing to me as I don't see what it's compatible with.

kt3k · 2024-04-16T15:24:11Z

html/entities.ts

+ *
+ * // specifying a `form` option (default is `readability`):
+ * normalize("两只小蜜蜂", { form: "readability" }); // "两只小蜜蜂"
+ * normalize("两只小蜜蜂", { form: "compatibility" }); // "&#x4e24;&#x53ea;&#x5c0f;&#x871c;&#x8702;"


form: "compatibility" feels defeating the purpose of normalize to me. How about not having this option for now?

iuioiua · 2024-05-06T07:03:28Z

@lionel-rowe, to not keep stale PRs open, are you happy for us to close this PR for now? You can re-open the existing PR or a new one once it is ready.

lionel-rowe · 2024-05-06T07:17:51Z

@lionel-rowe, to not keep stale PRs open, are you happy for us to close this PR for now? You can re-open the existing PR or a new one once it is ready.

Sure, I'll close now.

feat(html): add normalize function for HTML entities (denoland#4523)

f726a8d

lionel-rowe requested a review from kt3k as a code owner March 26, 2024 11:26

github-actions bot added the html label Mar 26, 2024

lionel-rowe added 2 commits March 26, 2024 20:05

Always escape XML-restricted chars

ac41a27

Handle default options consistently

f15c4fa

kt3k reviewed Apr 16, 2024

View reviewed changes

lionel-rowe marked this pull request as draft April 23, 2024 05:59

lionel-rowe closed this May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(html): add `normalize` function for HTML entities (#4523) #4524

feat(html): add `normalize` function for HTML entities (#4523) #4524

lionel-rowe commented Mar 26, 2024 •

edited

Loading

kt3k Apr 16, 2024

lionel-rowe Apr 23, 2024

kt3k Apr 16, 2024

kt3k Apr 16, 2024

iuioiua commented May 6, 2024

lionel-rowe commented May 6, 2024

feat(html): add normalize function for HTML entities (#4523) #4524

feat(html): add normalize function for HTML entities (#4523) #4524

Conversation

lionel-rowe commented Mar 26, 2024 • edited Loading

kt3k Apr 16, 2024

Choose a reason for hiding this comment

lionel-rowe Apr 23, 2024

Choose a reason for hiding this comment

kt3k Apr 16, 2024

Choose a reason for hiding this comment

kt3k Apr 16, 2024

Choose a reason for hiding this comment

iuioiua commented May 6, 2024

lionel-rowe commented May 6, 2024

feat(html): add `normalize` function for HTML entities (#4523) #4524

feat(html): add `normalize` function for HTML entities (#4523) #4524

lionel-rowe commented Mar 26, 2024 •

edited

Loading